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Abstract : We consider the problem of predicting as well as the best linear combi- 
nation of d given functions in least squai^es regression under L°° constraints on the linear- 
combination. When the input distribution is known, there already exists an algorithm hav- 
ing an expected excess risk of order d/n, where n is the size of the training data. Without 
this strong assumption, standard results often contain a multiplicative log n factor, com- 
plex constants involving the conditioning of the Gram matrix of the covariates, kurtosis 
coefficients or some geometric quantity characterizing the relation between and L°°- 
balls and require some additional assumptions like exponential moments of the output. 

This work provides a PAC-Bayesian shrinkage procedure with a simple excess risk 
bound of order d/n holding in expectation and in deviations, under various assumptions. 
The common surprising factor of these results is their simplicity and the absence of ex- 
ponential moment condition on the output distribution while achieving exponential de- 
viations. The risk bounds are obtained through a PAC-Bayesian analysis on truncated 
differences of losses. We also show that these results can be generalized to other strongly 
convex loss functions. 
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Introduction 



*Our statistical task Let = {Xi, Fi), . . . , Z„ = y„) be n > 2 pairs 
of input-output and assume that each pair has been independently drawn from the 
same unknown distribution P. Let X denote the input space and let the output 
space be the set of real numbers M, so that P is a probability distribution on the 
product space 2. = X x M. The target of learning algorithms is to predict the output 
Y associated with an input X for pairs Z = {X, Y) drawn from the distribution 
P. The quality of a (prediction) function / : X — )■ M is measured by the least 
squares risk: 

Rif)^Ez^p{[Y-fiX)]'}. 

Through the paper, we assume that the output and all the prediction functions we 
consider are square integrable. Let be a closed convex set of W^, and y^i, . . . , (/^^ 
be d prediction functions. Consider the regression model 

d 



1^ 7 = 1 ^ 



J 

The best function /* in 5" is defined by 

d 

/* = V^,V, Gargmin R{f). (0.1) 

/^^ 

Such a function always exists but is not necessarily unique. Besides it is unknown 
since the probability generating the data is unknown. 

We will study the problem of predicting (at least) as well as function /*. In other 
words, we want to deduce from the observations Zi, . . . , Z„ a function / having 
with high probability a risk bounded by the minimal risk R{f*) on 5" plus a small 
remainder term, which is typically of order d/n. Except in particular settings (e.g., 
when 6 is a probability simplex^ and d > y/n), it is known that the convergence 
rate d/n cannot be improved in a minimax sense (see [25], and [27] for related 
results). 

More formally, the target of the paper is to develop estimators / for which the 
excess risk is controlled in deviations, i.e., such that for an appropriate constant 
K > 0, for any e > 0, with probability at least 1 — e, 

R(f) - BiT) < K^^i^^. (0.2) 

n 



^This corresponds to the convex aggregation problem, which has been widely studied by several 
authors since the work of Nemirovski and Judisky [22, 18]. This particular setting is not the topic 
of this paper, but our results apply to it, and correspond to the minimax optimal rate for d < ^Jn. 
For d > y/n, the minimax optimal rate of convex aggregation is ■\/log(l + d/y/n)/n, which is 
not achieved by our procedure. 
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Note that by integrating the deviations (using the identity KW = /q °° ^{W > 
t)dt which holds true for any nonnegative random variable W), Inequality (0.2) 
implies 

E/2(/) - R{n < k'^. (0.3) 

n 

In this work, we do not assume that the function 

which minimizes the risk R among all possible measurable functions, belongs to 
the model 5". So we might have /* ^ Z*^'*^^^ and in this case, bounds of the form 

ER{f) - i?(/(-g)) < C[R{n - R{f'''')] + f^-, (0.4) 

n 

with a constant C larger than 1 do not even ensure that E_R(/) tends to i?(/*) when 
n goes to infinity. This kind of bounds with C > 1 have been developed to analyze 
nonparametric estimators using linear approximation spaces, in which case the 
dimension d is a function of n chosen so that the bias term R{f*) — R{p^^^^) has 
the order d/n of the estimation term (see [16] and references within). Here we 
intend to assess the generalization ability of the estimator even when the model 
is misspecified (namely when R{f*) > R{f^^^^^)). Moreover we do not assume 
either that Y — f^^^^\X) and X are independent. 

Notation. When 6 = M'', the function /* and the space J will be written 
and J'lin to emphasize that 5" is the whole linear space spanned by y^i, . . . , (/j^: 

Jiin = span{(/9i, . . . , v^d} and f^^^ G argmin R{f). 

The Euclidean norm will simply be written as || ■ ||, and (■, ■) will be its associated 
inner product. We will consider the vector valued function Lp : X ^ defined 
by (p{X) = (X)]^^_^, so that for any 6^ G 6, we have 

fe{X) = {e,v{X)). 

The Gram matrix is the d x d-matrix Q = E,[ip{X)Lp{X)'^~\ , and its smallest and 
largest eigenvalues will respectively be written as gmin and gmax- The empirical 
risk of a function / is 

1 " 

i=l 

and for A > 0, the ridge regression estimator on 5" is defined by /("'^s'^^ = f^^M^^^ 
with 

^("'*se) g argminr(/e) + A||^||^ 
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where A is some nonnegative real parameter. In the case when A = 0, the ridge 
regression p^^^^^ is nothing but the empirical risk minimizer In the same 

way, we introduce the optimal ridge function optimizing the expected ridge risk: 
f = fg with 

9 e aigmml Rife) + \\\ef]. (0.5) 

6»G0 ^ ^ 

Finally, let Qa = Q + A/ be the ridge regularization of Q, where / is the identity 
matrix. 

Outline and contributions. The paper is organized as follows. Section 1 
is a survey on risk bounds in linear least squares regression. Theorems 1.3 and 
1.5 are the results which come closer to our target. Section 2 presents our main 
result on linear least squares regression. Section 3 gives risk bounds for general 
loss functions from which the results of Section 2 are derived. Appendix A shows 
that (0.2) cannot hold under the only assumption that the variance of Y is finite, 
even in the favorable situation where belongs to 5". 

The main contribution of this paper is to show that an appropriate shrinkage 
estimator involving truncated differences of losses has an excess risk of order d/n 
(without a logarithmic factor as it appears in numerous works), concentrating ex- 
ponentially, which does not degrade when the matrix Q is ill-conditioned or when 
some ratio of and L°° norms behaves badly or when the output distribution is 
heavy-tailed. Our results tend to say that shrinkage and truncation lead to more 
robust algorithms when we consider robustness with respect to the distribution of 
the noise, and not to a potential contamination of the training data by input-output 
pairs not generated by P. 



1. Variants of known results 

1.1. Ordinary least squares and empirical risk minimization. The 
ordinary least squares estimator is the most standard method in linear least squares 
regression. It minimizes the empirical risk 

1 

i=l 

among functions in 3^ii„ and produces 
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with = [9f^^^]'j=i a column vector satisfying 

X^X = X^Y, (1.1) 
where Y = [^•]j=i and X = {(pj{X.i))i<i<n,i<j<d- It is well-known that 

• the linear system (1.1) has at least one solution, and in fact, the set of so- 
lutions is exactly {X"*^ Y +u; u E ker X}; where X^ is the Moore-Penrose 
pseudoinverse of X and ker X is the kernel of the linear operator X. 

• X 9^°^^^ is the (unique) orthogonal projection of the vector Y G M" on the 
image of the linear map X; 

• if sup3,gx Var(F|X = x) = < +oo, we have (see [16, Theorem 11.1]) 
for any Xi, . . . , X„ in X, 



i=l 



Xl, . . . , Xn 



- min 1 [/(^^) - f'-'^m" < a^'-^^ < a^^, (1.2) 

i=l 

where we recall that f^'^^^^ : x i— )■ E[F|X = x] is the optimal regression 
function, and that when this function belongs to ff'iin (i.e., /^"^^^ = /jj^), the 
minimum term in (1.2) vanishes; 
• from Pythagoras' theorem for the (semi)norm W y/EW^ on the space 
of the square integrable random variables, 

= E[/>'^)(X) - f'''\X)\Z,, . . . , - E[/,t„(X) - f^^\X)Y. 

(1.3) 

The analysis of the ordinary least squares often stops at this point in classical sta- 
tistical textbooks. (Besides, to simplify, the strong assumption f '^^^^ = is often 
made.) This can be misleading since Inequality (1.2) does not imply a d/n upper 
bound on the risk of /(°''^'. Nevertheless the following result holds [16, Theorem 
11.3]. 

Theorem 1.1 If sup^.^^ Var(F|X = x) = < +oo and 

||/(-s)|U = sup|/('-^s)(x)| <i/ 

for some H > 0, then the truncated estimator /^''^^ = (Z^"'**' A H) V —H satisfies 

mff'') - Rif'''') < mifL) - ^(/^"^')] + (1.4) 

Th 

for some numerical constant n. 



Using PAC-Bayesian inequalities, Catoni [10, Proposition 5.9.1] has proved a 
different type of results on the generalization ability of 

Theorem 1.2 Let 3^' C 3^iin be such that for some positive constants a, M, M' : 

• there exists fo G 3^' s.t.for any x G X, 

Ejexp a|F-/o(X)| X = x} < M; 

• for any /i, /s G 3^', sup.^x \h{x) ' f2{x)\ < M'. 

Let Q = K[(p{X)(p{X)'^~\ and Q = XlILi V^("^j)v^("^«)"^] be respectively the 
expected and empirical Gram matrices. If det Q ^ 0, then there exist positive 
constants Ci and C2 (depending only on a, M and M') such that with probability 
at least 1 — e, as soon as 

|/ G S^nn : r(/) < r(/>'^>) + C 3^', (1.5) 

we have 

c/ + log(£-i) + log(^) 

Rif^'-') - RifL) < c. 



--2" 

n 



This result can be understood as follows. Let us assume we have some prior 
knowledge suggesting that belongs to the interior of a set T C J'un (e.g., 
a bound on the coefficients of the expansion of as a linear combination of 
<y9i, . . . , ipd). It is likely that (1.5) holds, and it is indeed proved in Catoni [10, 
section 5.11] that the probability that it does not hold goes to zero exponentially 
fast with n in the case when 3^' is a Euclidean ball. If it is the case, then we know 
that the excess risk is of order d/n up to the unpleasant ratio of determinants, 
which, fortunately, almost surely tends to 1 as n goes to infinity. 

By using localized PAC-Bayes inequalities introduced in Catoni [9, 1 1], one can 
derive from Inequality (6.9) and Lemma 4.1 of Alquier [1] the following result. 

Theorem 1.3 Let gmin be the smallest eigenvalue of the Gram matrix Q = 
E[y9(X)(/)(X)^]. Assume that there exist a function fo G 5'iin and positive con- 
stants H and C such that 

ll/lin - fo\\oo< H. 

and 1 1^ I < C almost surely. 

Then for an appropriate randomized estimator requiring the knowledge of fo, 
H and C, for any e > with probability at least 1 — e w.r.t. the distribution 
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generating the observations Zi, . . . , Zn and the randomized prediction function 
f, we have 

Rif) - RifL) < + c') '"°'^"''"''-' + '°'^["°'^"'^"'l , (1.6) 

Th 

for some k, not depending on d and n. 

Using the result of [10, Section 5.11], one can prove that Alquier's result still 
holds for / = f^°^^\ but with k also depending on the determinant of the prod- 
uct matrix Q. The log[log(n)] factor is unimportant and could be removed in 
the special case quoted here (it comes from a union bound on a grid of pos- 
sible temperature parameters, whereas the temperature could be set here to a 
fixed value). The result differs from Theorem 1.2 essentially by the fact that 
the ratio of the determinants of the empirical and expected product matrices has 
been replaced by the inverse of the smallest eigenvalue of the quadratic form 
6 R{J2'j=i^jfj) ~ -^(/lin)- the case when the expected Gram matrix is 
known, (e.g., in the case of a fixed design, and also in the slightly different context 
of transductive inference), this smallest eigenvalue can be set to one by choosing 
the quadratic form 9 t-^ Rife) — R{fim) to define the Euclidean metric on the 
parameter space. 

Localized Rademacher complexities [19, 6] allow to prove the following prop- 
erty of the empirical risk minimizer. 

Theorem 1 .4 Assume that the input representation (p{X), the set of parameters 
and the output Y are almost surely bounded, i.e., for some positive constants H 
and C, 

supll^ll <1 
ess sup ||<^(X)|| < H, 

and 

\Y\ < C a.s.. 

Let Vi > ■ • • > ^dbe the eigenvalues of the Gram matrix Q = K^ip{X)ip{X)'^y 
The empirical risk minimizer satisfies for any e > 0, with probability at least 1 —e: 

i?(/(e-)) _ < + Cf^-^!^ 1^ i 

< k(H + C) 

n 

where a is a numerical constant. 



n 



2rank(Q) + log(e ^] 
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Proof. The result is a modified version of Theorem 6.7 in [6] applied to the linear 
kernel k{u, v) = {u, v) / (H + Cy. Its proof follows the same lines as in Theorem 
6.7 mutatis mutandi: Corollary 5.3 and Lemma 6.5 should be used as intermedi- 
ate steps instead of Theorem 5.4 and Lemma 6.6, the nonzero eigenvalues of the 
integral operator induced by the kernel being the nonzero eigenvalues of Q. □ 

When we know that the target function f^^^ is inside some ball, it is natu- 
ral to consider the empirical risk minimizer on this ball. This allows to compare 
Theorem 1.4 to excess risk bounds with respect to 

Finally, from the work of Birge and Massart [7], we may derive the following 
risk bound for the empirical risk minimizer on a ball (see Appendix B). 

Theorem 1.5 Assume that 3^ has a diameter upper bounded by H for the L°°- 
norm, i.e., for any /i, /2 in 3^, sup^gj \fi{^) ~ f2{x)\ < H and there exists a 
function /o G 5" satisfying the exponential moment condition: 

foranyxeX, sjexp |F - /o(X) | X = x| < M, (1.7) 

for some positive constants A and M. Let 



B = inf sup 



12 

I oo 



where the infimum is taken with respect to all possible orthonormal basis of J for 
the dot product (/i, = E/i(X)/2(X) (when the set 3" admits no basis with 
exactly d functions, we set 13 = +oo). Then the empirical risk minimizer satisfies 
for any £ > 0, with probability at least 1 — e: 

_ < .(A^ + ^.)^l°g[2 + Wn)A(nM]+log(.-)_ 

n 

where k is a positive constant depending only on M. 

This result comes closer to what we are looking for: it gives exponential devi- 
ation inequalities of order at worse d\og{n/d)/n. It shows that, even if the Gram 
matrix Q has a very small eigenvalue, there is an algorithm satisfying a conver- 
gence rate of order d\og{n/ d) /n. With this respect, this result is stronger than 
Theorem 1.3. However there are cases in which the smallest eigenvalue of Q is 
of order 1, while B is large (i.e., B ^ n). In these cases. Theorem 1.3 does not 
contain the logarithmic factor which appears in Theorem 1.5. 



1.2. Projection estimator. When the input distribution is known, an al- 
ternative to the ordinary least squares estimator is the following projection esti- 
mator. One first finds an orthonormal basis of for the dot product (/i, /2) = 
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E/i(X)/2(X), and then uses the projection estimator on this basis. Specifically, 
if 01, . . . , 0(i form an orthonormal basis of 3^^, then the projection estimator on 
this basis is: 



d 



j(proj) ^ ^ ^(proj) 

with 

1 " 

i=l 

The following excess risk bound of order d/n for this estimator is Theorem 4 in 
[25] up to minor changes in the assumptions. 

Theorem 1.6 7/'sup^gx^ar(F|X = x) = cr^ < +oo and 

||/(-s)|U = sup|/(-s)(^)l<^<+oo, 

then we have 

ER{f'^^°^^) - R{f;j < {a' + H')-. (1.8) 

n 

1.3. Penalized least squares estimator. It is well established that pa- 
rameters of the ordinary least squares estimator are numerically unstable, and that 
the phenomenon can be corrected by adding an penalty ([20, 23]). This solu- 
tion has been labeled ridge regression in statistics ([17]), and consists in replacing 

d 

'(ridge) rz oromin ) r( f„\ -L \ (fl 



e argmin <^r(/e) + A I 



where A is a positive parameter. The typical value of A should be small to avoid 
excessive shrinkage of the coefficients, but not too small in order to make the 
optimization task numerically more stable. 

Risk bounds for this estimator can be derived from general results concerning 
penalized least squares on reproducing kernel Hilbert spaces ([8]), but as it is 
shown in Appendix C, this ends up with complicated results having the desired 
djn rate only under strong assumptions. 

Another popular regularizer is the norm. This procedure is known as Lasso 
[24] and is defined by 



.7=1 ^ 
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As the penalty, the penalty shrinks the coefficients. The difference is that 
for coefficients which tend to be close to zero, the shrinkage makes them equal to 
zero. This allows to select relevant variables (i.e., find the j's such that 9* ^ 0). 
If we assume that the regression function is a linear combination of only 
d* <^ d variables/functions ^p/s, the typical result is to prove that the risk of 
the Lasso estimator for A of order a/ (log d)/n is of order {d* log d)/n. Since this 
quantity is much smaller than d/n, this makes a huge improvement (provided 
that the sparsity assumption is true). This kind of results usually requires strong 
conditions on the eigenvalues of submatrices of Q, essentially assuming that the 
functions ifj are near orthogonal. We do not know to which extent these conditions 
are required. However, if we do not consider the specific algorithm of Lasso, but 
the model selection approach developed in [1], one can change these conditions 
into a single condition concerning only the minimal eigenvalue of the submatrix of 
Q corresponding to relevant variables. In fact, we will see that even this condition 
can be removed. 

1.4. Conclusion of the survey. Previous results clearly leave room to im- 
provements. The projection estimator requires the unrealistic assumption that the 
input distribution is known, and the result holds only in expectation. Results using 

or regularizations require strong assumptions, in particular on the eigenval- 
ues of (submatrices of) Q. Theorem 1.1 provides a (d\ogn)/n convergence rate 
only when the — R{f^^^^^) is at most of order (dlogn) /n. Theorem 1.2 

gives a different type of guarantee: the d/n is, indeed achieved, but the random 
ratio of determinants appearing in the bound may raise some eyebrows and forbid 
an explicit computation of the bound and comparison with other bounds. Theorem 
1 .3 seems to indicate that the rate of convergence will be degraded when the Gram 
matrix Q is unknown and ill-conditioned. Theorem 1.4 does not put any assump- 
tion on Q to reach the d/n rate, but requires particular boundedness constraints 
on the output. Finally, Theorem 1.5 comes closer to what we are looking for. Yet 
there is still an unwanted logarithmic factor, and the result holds only when the 
output has uniformly bounded conditional exponential moments, which as we will 
show is not necessary. 

Our recent work [4] provides a risk bound for ridge regression showing the 
benefit on the effective dimension of the shrinkage parameter A and being of or- 
der d/n (without logarithmic factor). The work [4] also proposes a robust esti- 
mator for linear least squares, which satisfies a d/n excess risk bound without 
logarithmic factor, but with constants involving several kurtosis coefficients. As 
discussed in Section 3.2 of [4], depending on the basis functions and the distribu- 
tion P, these kurtosis coefficients typically behave either as numerical constants 
or \/d (but worse non-asymptotic behaviors of these constants can also occur). 
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Finally, several works, and in particular those cited in Section 1.1, have con- 
sidered the problem of model selection where several linear spaces are simultane- 
ously considered, and the goal is to predict as well as the best function in the union 
of the linear spaces. Only a few of them considered the case of outputs having only 
finite conditional moments (and not finite conditional exponential moments). This 
is the case of [5] in the fixed design setting and [26] in the random design setting. 
The excess risk bounds there are typically of order d/n with d the dimension of 
the "best" linear space, but holds in expectation and essentially when the optimal 
regression function /"^"^^^^ belongs to the union of linear spaces. 



2. A SIMPLE TIGHT RISK BOUND FOR A SOPHISTICATED PAC-BAYES 

ALGORITHM 



In this section, we provide a sophisticated estimator, having a simple theoret- 
ical excess risk bound, with neither a logarithmic factor, nor complex constants 
involving the conditioning of Q, kurtosis coefficients or some geometric quantity 
characterizing the relation between and -balls. 

We consider that the set is bounded so that we can define the "prior" distri- 
bution TT as the uniform distribution on 5" (i.e., the one induced by the Lebesgue 
distribution on 9 C M"' renormalized to get 7r(J) = 1). Let A > and 

W,{fJ') = \{[Y, - /(X,)]' - [F. - /'(X,)]'}. 

Introduce 

n:ii[i-^.(/,/') + iw^^(/,/')^]' ■ 

We consider the "posterior" distribution tt on the set J with density: 

^TT. exp [-£(/)] 

/exp[-£(f)]7r(rf/')' 

To understand intuitively why this distribution concentrates on functions with low 
risk, one should think that when A is small enough, 1 — /') + /')^ 

is close to e"^'^'^'-^'^ and consequently 

n „ n 

£(/)^A5^[F,-/(X,)]2 + log / vr(rfr)exp{-A^[F,-/'(X,)]'}, 

i=l 1=1 

and 



rfTT^^^ /exp{-AEr=i[>^ -/'(^0]'}^m ■ 
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The following theorem gives a.d/n convergence rate for the randomized algorithm 
which draws the prediction function from 3" according to the distribution tt. 



Theorem 2.1 Assume that 3^ has a diameter upper bounded by H for the L°°- 
norm: 

sup \h{x) - h{x)\ < H (2.3) 

and that, for some a > 0, 

supE{[r - f\X)f\X = x] <a'^ < +00. (2.4) 

Let f be a prediction function drawn from the distribution tx defined in (2.2) and 
depending on the parameter A > 0. Then for any < 77' < 1 — A(2o" + HY 
and e > 0, with probability (with respect to the distribution P'^^-ji generating the 
observations Zi, . . . , Z„ and the randomized prediction function f) at least 1 —e, 
we have 



^Cid + C2\og{2e-'] 



R{f) - R{f*) < {2a + H) 

n 

with 

= '?'(^-^) and C2 = ^ ^ and r] = \(2a + Hf 

77(1 — r] — r]'j 77(1 — r] — T]') 

In particular for A = 0.32(2cr + H)~^ and r]' = 0.18, we get 



R{f) - R{n < {2<y + H) 



2 16.6 d+ 12.5 log(2£~i) 
n 



Besides if f* E ar grain j-^^^R{f), then with probability at least 1 — e, we have 
R{f) - R{f*) < {2cr + Hf 



2 8.3 d+ 12.5 log(2£-i^ 



n 



Proof. This is a direct consequence of Theorem 3.5 (page 21), Lemma 3.3 
(page 19) and Lemma 3.6 (page 23). □ 

If we know that f^^^ belongs to some bounded ball in J'lin, then one can define a 
bounded 5" as this ball, use the previous theorem and obtain an excess risk bound 
with respect to /jj^. 

Remark 2.1 Let us discuss this result. On the positive side, we have a d/n con- 
vergence rate in expectation and in deviations. It has no extra logarithmic factor. 
It does not require any particular assumption on the smallest eigenvalue of the 
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covariance matrix. To achieve exponential deviations, a uniformly bounded sec- 
ond moment of the output knowing the input is surprisingly sufficient: we do not 
require the traditional exponential moment condition on the output. Appendix A 
(page 34) argues that the uniformly bounded conditional second moment assump- 
tion cannot be replaced with just a bounded second moment condition. 

On the negative side, the estimator is rather complicated. With nowadays com- 
puters and numerical methods, it seems impossible to get a good approximation 
of it even when the dimension d is small. Nevertheless, in presence of a heavy- 
tailed noise distribution, it can be a way to move from the empirical risk minimizer 
(which is the baseline estimator for linear regression) in the right direction (that 
is in a direction in which one can find an estimator having a smaller risk than 
the one of the empirical risk minimizer). When the target is to predict as well as 
the best linear combination /j*^ up to a small additive term, the estimator requires 
the knowledge of a -bounded ball in which lies and an upper bound on 
sup3.gxE{ [Y — /j*j^(X)]^|X = x}. The looser this knowledge is, the bigger the 
constant in front of d/n is. Note that the possible lack of knowledge of H and a 
call for a model selection algorithm, which goes beyond the scope of this work. 
In practice, a careful application of (cross-)validation ideas would probably be 
sufficient to select these parameters. 

Remark 2.2 The proposed randomized estimator is more complex than the clas- 
sical Gibbs estimator (that is the one with exponential weights involving the em- 
pirical risk). Even if the paper does not prove it, (we believe that) the classical 
Gibbs estimator cannot be robust to heavy-tailed noise. This belief is motivated 
by the same arguments as the ones used in [12] to show the absence of robustness 
of the empirical mean estimator. In absence of heavy-tailed noise, the classical 
Gibbs estimator satisfies a similar result to Theorem 2.1, given in Theorem 3.2. 

Our randomized algorithm consists in drawing the prediction function accord- 
ing to fc. As usual, by convexity of the loss function, the risk of the deterministic 
estimator /determ = / fTT{df) satisfies i?(/determ) < / Rif)TT{df), so that, after 
some computations, one can prove that for any e > 0, with probability at least 

n 

for some appropriate numerical constant k, > 0. 

Remark 2.3 We consider a "prior" distribution tt, which is a uniform distri- 
bution on 3". In presence of sparsity (when only a small number of the coeffi- 
cients 9* in (0.1) are nonzero), alternative prior distributions (of Laplace form) 
are useful in fixed design regression [13, 14, 2] and in the random design scenario 
[15, 2]. When the coefficient vector 9* is non-sparse (which is not the focus of 
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these works), the latter papers prove a risk bound when the noise distribu- 
tion admits at least sub-exponential tails. 

Remark 2.4 Theorem 2.1 expresses boundedness in terms of the L°° diameter 
of the set of functions 5". Besides, (2.4) implies that the function p^^^^ : x 
E[F|X = x] satisfies f^'^'^^X) - f*{X) < a almost surely. By using Lemma 
3.7 (page 23) instead of Lemma 3.6 (page 23), Theorem 2.1 still holds without 
assuming (2.3) and (2.4), when replacing (2a + Hy with 



V 



2 J sup E(/(X)2[F-/*(X)P) 

/GS-u„:E[/(X)2]=l 



+ / sup E([f(X)-/"(X)P) / sup E[/(X)4] 
V /' J"63- Y /e:ru„:E[/(x)2]=i 

The quantity V is finite when simultaneously, 6 is bounded, and for any j in 

{1, . . . , 4, the quantities E[v?|(X)] and E{ipj{X)^[Y - f*iX)]^} are finite. 



3. A GENERIC LOCALIZED PAC-BAYES APPROACH 



3.1 . Notation and setting. In this section, we drop the restrictions of the 
linear least squares setting considered so far in order to focus on the ideas under- 
lying the estimator and the results presented in Section 2. To do this, we consider 
that the loss incurred by predicting y' while the correct output is y is £(y, y') 
(and is not necessarily equal to {y — y'Y). The quality of a (prediction) function 
/ : X — >■ R is measured by its risk 

/?(/)= E{£[y,/(x)]}. 

We still consider the problem of predicting (at least) as well as the best function in 
a given set of functions 5" (but 5" is not necessarily a subset of a finite dimensional 
linear space). Let /* still denote a function minimizing the risk among functions 
in J: f* G argmin^g^ R{f). For simplicity, we assume that it exists. The excess 
risk is defined as 

Rif) = Rif)-Rin. 

Let i : Z.xJ'xS^— )>]Rbea function such that i{Z, f, f) represents^ how worse 
/ predicts than /' on the data Z. Let us introduce the real- valued random processes 

''While the natural choice in the least squares setting is £{{X, Y),f, /') = - fiX)]'^ - 
[Y — f'{X)]'^, we will see that for heavy-tailed outputs, it is preferable to consider the following 
soft-truncated version of it, up to a scaling factor A > 0: ({{X, F), /, /') = T[X[{Y - f{X)Y - 
{Y - f'{X)f]), with T{x) ^ - log(l -x + x'^/2). Equality (3.4, page 16) corresponds to (2.1, 
page 12) with this choice of function £ and for the choice tt* = tt. 
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L : (/,/') ^ andL, : (/,/') ^ i^Z,, f, f), where Z, Z^, 

denote i.i.d. random variables with distribution P. 

Let TT and tt* be two (prior) probability distributions on 5". We assume the fol- 
lowing integrability condition. 

Condition I. For any / G 3", we have 



and 



E{exp[L(/,/')]}V(d/0<+oo, 



/E{exp[L(/,f)]}V(df) 
We consider the real- valued processes 



< +00. 



£(/) 

L\fJ') 
and £«(/) 



i=l 

log J exp[L(/,/')] 

exp(-L(/, /'))]} 
exp(L(/, /'))]}, 
logjl exp[L«(/,/')] ^*idf') 



— nlog|E 
nlog< E 



(3.1) 
(3.2) 

(3.3) 

(3.4) 

(3.5) 
(3.6) 

(3.7) 



Essentially, the quantities L{f, /'), L^(/, /') and L'^{f, /') represent how worse is 
the prediction from / than from /' with respect to the training data or in expecta- 
tion. By Jensen's inequality, we have 

< nE{L) = E(L) < LK (3.8) 

The quantities £(/) and £**(/) should be understood as some kind of (empirical 
or expected) excess risk of the prediction function / with respect to an implicit 
reference induced by the integral over 5". 

For a distribution p on 5" absolutely continuous w.r.t. tt, let ^ denote the den- 

dn 

sity of p w.r.t. tt. For any real-valued (measurable) function h defined on 3" such 
that / exp[h(f)]n{df) < +00, we define the distribution Hh on 3" by its density: 



dTTh 



if) 



exp[/i(/)] 



dn^'' JeMhifMdfy 
We will use the posterior distribution: 



dix 



diT 



if) 



exp[-£(/)] 



/exp[-£(/0]vr(df) 



(3.9) 



(3.10) 
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Finally, for any /3 > 0, we will use the following measures of the size (or com- 
plexity) of 5" around the target function: 

r(/3) = -log{/exp[-/3i?(/)]7r*(d/)} 

and 

J(/3) = -Iog{/exp[-/3i2(/)]7r(rf/)}. 

3.2. The localized PAC-Bayes bound. With the notation introduced in 
the previous section, we have the following risk bound for any randomized esti- 
mator. 

Theorem 3 . 1 Assume that tt, it*, 3^ and i satisfy the integrability conditions 
(3.1) and (3.2, page 16). Let p be a (posterior) probability distribution on 3^ ad- 
mitting a density with respect to n depending on Zi, . . . , Zn- Let f be a prediction 
function drawn from the distribution p. Then for any 7 > 0, 7* > and e > 0, 
with probability (with respect to the distribution P^'^p generating the observa- 
tions Zi, . . . , Z„ and the randomized prediction function f) at least 1 — e: 

I [L\fJ)+^*RU)y^^,^{df)-^R{f) 

< r(7*) - %) -logjl exp [-£»(/)] 7r(rf/) 

+ 21og(2£-^). (3.11) 
Proof. See Section 4.2 (page 26). □ 

Some extra work will be needed to prove that Inequality (3.11) provides an 
upper bound on the excess risk R{f) of the estimator /. As we will see in the next 
sections, despite the —^R{f ) term and provided that 7 is sufficiently small, the 
left-hand side will be essentially lower bounded by XnR{f ), while, by choosing 
p = fc, the estimator does not appear in the right-hand side. 

3.3. Application under an exponential moment condition. The es- 
timator proposed in Section 2 and Theorem 3.1 seems rather unnatural (or at least 
complicated) at first sight. The goal of this section is twofold. First it shows that 
under exponential moment conditions (i.e., stronger assumptions than the ones in 
Theorem 2.1 when the linear least square setting is considered), one can have a 
much simpler estimator than the one consisting in drawing a function according to 
the distribution (2.2) with £ given by (2.1) and yet still obtain a d/n convergence 



+ log 



17 



rate. Secondly it illustrates Theorem 3.1 in a different and simpler way than the 
one we will use to prove Theorem 2.1. 

In this section, we consider the following variance and complexity assumptions. 

Condition VI. There exist A > and < 77 < 1 such that for any function 

/ G J, wehaveE|exp{A£[y',/(X)]}| < +00, 
log{E{exp{A [i[YJ{X)]-i[Y,r{X)]]}}} 

< X{1 + r])[R{f) - Rini 
and \og[E[exp[-X[i[YJ{X)]-i[Y,nX)]\}}} 

< .Xil - v)[Rif) - Rif*)]. 

Condition C. There exist a probability distribution tt, and constants D > and 
G > such that for any < a < /3, 

( feM-am)-R(r)]Mdf) \ ^ (Gi\ 

^\jexp{-l3\R(f)-Rin]Mdf)) - H « 

Theorem 3.2 Assume that VI and C are satisfied. Let vr^'^'''''''*^ be the probability 
distribution on 5" defined by its density 

^^(G.bbs) _ exp{-A Er=i f{X.)]} 



leM->^j:iimj'm]Mdf'y 

where A > and the distribution n are those appearing respectively in VI and C. 
Let f E 3^ be a function drawn according to this Gibbs distribution. Then for any 
7]' such that < 77' < 1 — ?7 (where 77 is the constant appearing in Yl) and any 
e > 0, with probability at least 1 — e, we have 



n 



with 



log 



Gil + v) 



C[ = ^' / and C'2 



X{l-r]-r]') ^ A(l-r/-V)' 

Proof. We consider ^ [(X, Y), f, f] = X{i[Y, f{X)] - i[Y, f'{X)] }, where 
A is the constant appearing in the variance assumption. Let us take 7* = and 
let TT* be the Dirac distribution at /*: tt* ({/*}) = 1. Then Condition VI implies 
Condition I (page 16) and we can apply Theorem 3.1. We have 



Hf, f) = A{/[r, /(X)] - i[Y, f{X)] }, 
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TT = vr 



i=l 
(Gibbs) 



i=l 



L^(/) = -nlog{E[exp[-L(/,r)]]}, 
£«(/)= nlog{E[exp[L(/,r)]]} 
and Assumption VI leads to: 

log{E [exp [L(/, /*)]]}< A(H- r^) - 
and log{E[exp[-L(/, /*)]] } < -A(l - r/)[i?(/) - i?(r)]. 

Thus choosing p = vr, (3.11) gives 

[An(l-r/)-7]i?(/) < + j[Xn{l + r])] + 2\og{2e''). 

Accordingly by the complexity assumption, for 7 < \n{l + 77), we get 

fGXn{l + r]y 



[Xn{l-r])-^]R{f)<D\o^ 



1 



21og(2£-i; 



which implies the announced result by reparameterization (taking 7 = Xnrj'). □ 
Let us conclude this section by mentioning settings in which assumptions VI 
and C are satisfied. 

Lemma 3.3 Let Q be a bounded convex set ofW^, and yji, . . . ,ipd be d square 
integrable prediction functions. Assume that 

d 

i=i 

TT is the uniform distribution on J (i.e., the one coming from the uniform distri- 
bution on Q), and that there exist < 61 < 62 such that for any y G M, the 
function iy : y' i{y, y') admits almost everywhere a second derivative such 
that, {y, y') 1— )■ dyiy') is measurable, for any y, y' e M, 61 < iy{y') < &2, cind 



£{y, y') = ^{y, y) + {y' - yYy{y) + / {y' - y'Ty{y"W. 

Then Condition C holds for the above uniform tt, G = 1^62/61 and D = d. 

Besides when f* = /[*„ (i.e., minji? = mingg^d R^fe)), Condition C holds for 
the above uniform vr, G = 62/&1 (^f^d D = d/2. 
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Proof. See Section 4.3 (page 30). □ 



Remark 3.1 In particular, for the least squares loss i{y, y') = (y — y'Y, we have 
6i = 62 = 2 so that condition C holds with tt the uniform distribution on3^, D = d 
and G = 1, and with D = d/2 and G = 1 when /* = 

Lemma 3.4 Assume that the loss function I satisfies the conditions stated in 
Lemma 3.3. Assume moreover that there exist A > and M > such that for any 

X ^ "JCf 



E<exp 



A 



[nx)] 



X = x[ <M. 



Assume that 3" is convex and has a diameter upper bounded by H for the L"^ 
norm: 

sup \fi{x)- f2{x)\<H. 

In this case Condition VI holds for any (A, rf) such that 

\A^ 



V 



> 



2hi 



exp 



M'^e-x.^[Hh2/A) 



and < A < {2AH) ^ is small enough to ensure 77 < 1. 
Proof. See Section 4.4 (page 31). □ 

3.4. Application without exponential moment condition. When we 
do not have finite exponential moments as assumed by Condition VI (page 18), 
e.g., when E{exp{A{^[F, /(X)] - i[Y,f*{X)]]]] = +00 for any A > and 
some function / in 5", we cannot apply Theorem 3.1 with y), /, /'] = 

A{£ [y, /(X)] - l\Y, /'(X)] } (because of the term). However, we can apply it 
to the soft truncated excess loss 



[(X,F),/,f] =t(\{1[YJ{X)\ -I[YJ\X)\] 



with T(x) = — log(l— x+x^/2). This section provides a result similarto Theorem 
3.2 in which condition VI is replaced by the following condition. 

Condition V2. For any function /, the random variable £ [F, /(X)] -I\Y, f* (X)] 
is square integrable and there exists V > such that for any function /, 



E 



[[l[YJ{X)] -i[YJ*{X)]]']<V[R{f)-R{f*)]. 
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Theorem 3.5 Assume that Conditions V2 above and C (page 18) are satisfied. 
LetO < X < V^^ and 

i[iX,Y),fJ'] =T(^X{i[Y,fiX)] -/[r,/'(X)]}), (3.12) 



with 



T{x) = -log(l -x + xV2). 



(3.13) 



Let f E 3^ be a function drawn according to the distribution n defined in (3.10, 
page 16) with £ defined in (3.4, page 16) and n* = n the distribution appearing 
in Condition C. Then for any < t]' < 1 — XV and e > 0, with probability at 
least 1 — e, we have 



R{f) - R{f*) < V 



C[D + C^\og{2e-') 



n 



with 



c' 



77(1 — 7] — 7]') ' ^ 77(1 — 7] — T]') 

In particular, for X = 0.32V~^ and 7]' = 0.18, we get 

16.6D + 12.51og(2\/Ge"i 



— ■ and 7] = XV. 



R{f)-R{f*)<V- 



71 



Proof. We apply Theorem 3.1 for ^ given by (3.12) and vr* = vr. Let us define, 
for any /, f G 5, WU, f) = A|^~[r, /(X)] -£ [F, f (X)] |. Since log n < n- 1 



for any n > 0, we have 

= -nlogE(l -W + W^/2) > n{E{W) - E{W^)/2). 
Moreover, from Assumption V2, 

^^^^l'^'^'^ < E[Wif, f*)'] +E[W{f', f*)'] < X'VRif) + X'VRif), 



(3.14) 



hence, by introducing 7] = XV, 

L\f, f) > X71 \R{f) - Rif) - XVRif) - XVRif) 



X71 



:i - r/)i?(/) - (1 + r/)i?(/') 



(3.15) 
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Noting that 



exp [T{u 



we see that 



1-u 



- < 1 + mH , 

1 + "1 - 2 ' 



+ «V2 (i + i^)2_^2 1 + 



L^ = nlog|E exp[T{W)] ^ < n E{W) +E{W^) /2 
Using (3.14) and still rj = W, we get 

LKf. f) < An [Rif) - Rif) + vR{f) + vRif) 

= \n{l + r])R{f)-\n{l-r])R{n 

and 

£"(/)< An(l+r/)^(/)-j(An(l-r/)). 
Plugging (3.15) and (3.16) in (3.11) for p = tt, we obtain 



(3.16) 



[Xn{l - 7/) - 7]i?(/) + [7* - An(l + r/)] j R{f)n^,.n{df) 

< J(7*)-J(7)+j(An(l + r/)) -J(An(l-r/)) +21og(2£-i). 

By the complexity assumption, choosing 'j* = \n{l + rj) and 7 < An(l — 77), we 
get 

,An(l + r/)2^ 



[An(l-r/)-7]i?(/) <Dlog G 



+ 21og(2£-i), 



7(1-7;) 

hence the desired result by considering 7 = \nr]' with 77' < 1 — 77. □ 

Remark 3.2 The estimator seems abnormally complicated at first sight. This 
remark aims at explaining why we were not able to consider a simpler estimator. 

In Section 3.3, in which we consider the exponential moment condition VI, 
we took i[{X, Y), /, /'] = X{£[Y, f{X)] - £[Y, f'{X)] } and tt* as the Dirac 
distribution at /*. For these choices, one can easily check that n does not depend 
on /*. 

In the absence of an exponential moment condition, we cannot consider the 
function e[{X, Y), f, f] = \{i[Y, f{X)] - £[Y, f'{X)] } but have instead to 
use a truncated version. The truncation function T of Theorem 3.5 can be re- 
placed by the simpler function m h-> (m V — M) A M for some appropriate constant 
M > but this leads to a bound with worse constants, without really simplifying 
the algorithm. The precise choice T(.x) = — log(l — x + comes from the 
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remarkable property: there exist second order polynomials and P" such that 
-p^ < exp[T{u)] < P^{u) and P\u)P^{u) < 1 + O(m^) forw ^ 0, which are 
reasonable properties to ask in order to ensure that (3.8), and consequently (3.1 1), 
are tight. 

Besides, if we take i as in (3.12) with T a truncation function and vr* as the 
Dirac distribution at /*, then tt would depend on /*, and is consequently not 
observable. This is the reason why we do not consider vr* as the Dirac distribution 
at /*, but 71* = TT. This leads to the estimator considered in Theorems 3.5 and 2.1. 

Remark 3.3 Theorem 3.5 still holds for the same randomized estimator in which 
(3.13, page 21) is replaced with 

T(x) = log(l + x + xV2). 

Condition V2 holds under weak assumptions as illustrated by the following 
lemma. 

Lemma 3.6 Consider the least squares setting: i{y, y') = {y — y'Y- Assume that 
3^ is convex and has a diameter upper bounded by H for the L°°-norm: 

sup \fi{x) - Mx)\ < H 

and that for some a > 0, we have 

supE{[F - /*(X)]^|X = x} < < +oo. (3.17) 

Then Condition V2 holds for V = {2a + H)"^. 
Proof. See Section 4.5 (page 33). □ 

Lemma 3 .7 Consider the least squares setting: i{y, y') = {y — y'Y- Assume that 
3^ (i.e., Q) is bounded, and that for any j e {1, . . . ,d}, K[(pj{X)^'j < + oo and 
E{ipj{Xy[Y - f*iX)f} < +00. Then Condition V2 holds for 



V 



2 J sup E(/(X)2[F-/*(X)]2) 

/e3-li„:E[/(X)2]=l 



+ / sup E([f(X)-/"(X)P) / sup E[f{Xr] 

V f',f"& Y /G:fu„:E[/(X)2]=l 

Proof. See Section 4.6 (page 33). □ 
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4. Proofs 



4.1. Main ideas of the proofs . The goal of this section is to explain the key 
ingredients appearing in the proofs which both allow to obtain sub-exponential 
tails for the excess risk under a non-exponential moment assumption and get rid 
of the logarithmic factor in the excess risk bound. 

4.1.1. Sub-exponential tails under a non-exponential moment assumption via trun- 
cation. Let us start with the idea allowing us to prove exponential inequali- 
ties under just a moment assumption (instead of the traditional exponential mo- 
ment assumption). To understand it, we can consider the (apparently) simplistic 
1 -dimensional situation in which we have = M and the marginal distribution of 
^Pi{X) is the Dirac distribution at 1. In this case, the risk of the prediction function 
fgisR{fe) = E[{Y-ey] =E[(F-EF)2]+(EF-^)2, so that the least squares 
regression problem boils down to the estimation of the mean of the output vari- 
able. If we only assume that Y admits a finite second moment, say E(F^) < 1, it 
is not clear whether for any £ > 0, it is possible to find 6 such that with probability 
at least 1 — 2e, 

Rife) - Rifl = im) -0Y< ^Mfl^, (4.1) 

for some numerical constant c. Indeed, from Chebyshev's inequality, the trivial 
choice = \ Yll=i j'^st satisfies: with probability at least 1 — 2e, 

Rife) - Rin < j-^, 

which is far from the objective (4.1) for small confidence levels (consider e = 
exp{—^/n) for instance). The key idea is thus to average (soft) truncated values 
of the outputs. This is performed by taking 

i=l ^ ^ 

with A = \J '^^°^^^ ^ (this mean estimator thus depends on the confidence level 
parameter e). Since we have 

logEexp(nA^) =nlogM +AE(F) + yE(F2)j <nXE{Y)+n—, 

the exponential Chebyshev's inequality (see Lemma 4.1) guarantees that with 
probability at least 1 — £, we have n\{6 — E(F)) < + log(£:~^), hence 

9_E(y)<,/lMi^. 

V n 
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Replacing Y by —Y in the previous argument, we obtain that with probability at 
least 1 — e, we have 

nA|E(F) + — ^ log (^1 - XY, + -^j j < n- + log{e-'). 
Since — log(l + x + x'^/2) < log(l — x + x'^/2), this implies 

E(r) - 9 < ,^ME1, 

V n 

The two previous inequalities imply Inequality (4.1) (for c = 2), showing that 
sub-exponential tails are achievable even when we only assume that the random 
variable admits a finite second moment (see [12] for more details on the robust 
estimation of the mean of a random variable). 

4.1.2. Localized PAC-Bayesian inequalities to eliminate a logarithm factor. The 
analysis of statistical inference generally relies on upper bounding the supremum 
of an empirical process x indexed by the functions in a model 5". One central tool 
to obtain these bounds are the concentration inequalities. An alternative approach, 
called the PAC-Bayesian one, consists in using the entropic equality 

Eexp (^supjy p{df)x{f)-K{py)^ = j 7r'(d/)Eexp(x(/)). (4.2) 

where M is the set of probability distributions on 5" and K{p, ir') is the KuUback- 
Leibler divergence (whose definition is recalled in (4.4, page 29)) between p and 
some fixed distribution tt'. 
Let f : 5" — M be an observable process such that for any / G 3^, we have 

Eexp <1 

for xif) = ^[R{f) ~ ^(/)] ^nd some A > 0. Then, as a consequence of (4.2), for 
any e > 0, with probability at least 1 — e, for any distribution p on 5", 

The left-hand side quantity represents the expected risk with respect to the distri- 
bution p. The question is now how to use (4.3) to design a posterior distribution 
p for which J p{df)R{f) is guaranteed to be small. The constraint on the choice 
of (p, tt') is that p should be computable from the data (e.g., it cannot depend on 
R) and tt' should not depend on the data: it may depend on R (in contrast with 
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Bayesian prior distributions!) but not on f. Simple choices like (p, n') = (5/., 6f*) 
or (p, vr') = (5/, 5j) for / G axgrninj ^^f{f), where (5a denotes the Dirac distri- 
bution at the function /, are thus forbidden (while they would have led to small 
right-hand side of (4.3)). 

For fixed tt', the posterior distribution minimizing the right-hand side of (4.3) 
is p = 7r'_xf It is computable from the data if n' is. Without prior knowledge, 
this would lead to take a "flat" distribution for tt' (e.g., the one induced by the 
Lebesgue measure in the case of a model 5" defined by a bounded parameter set in 
some Euclidean space). The resulting KuUback-Leibler divergence might be very 
large as it compares a distribution with a sharp peak (concentrated on functions 
/ G 5" for which f(/)) with a flat one. 

To get a smaller KuUback-Leibler divergence, we can take posterior and prior 
distributions which are peaked around almost the same function. This can be done 
by taking vr and p respectively concentrated around /* and /. More precisely, 
one can take posterior distributions of the form p = n^xf for some A > and a 
"flat" distribution tt computable without knowing neither the distribution P gen- 
erating the data nor the training data (in particular, tt must not depend on R or 
f), and a "localized" prior distribution tt' = vr^^/j for some (3 > 0. The pa- 
rameters A and (3 controlling the sharpness of the peaks at argmin^ggr-R(/)* and 
argminjggrr(/) should be taken such that the peaks overlap (to ensure that the 
KuUback-Leibler divergence is small) and are in the same time sharp enough (to 
ensure that / p{df)f{f) is small). The use of the "localized" prior distribution 
Tx' = Tx^pR implies an additional technical difficulty as one needs to control the 
divergence K{p, t^-pr)- This is achieved by writing 



K{p, n.pR) = K{p, vr) + log / exp[-/3i?(/)] 7r{df) ] + (3 R{f) p{df), 



and controlling the new logarithmic term through PAC-Bayesian inequalities. 

4.2. Proof of Theorem 3.1. We use the standard way of obtaining PAC 
bounds through upper bounds on Laplace transforms of appropriate random vari- 
ables. This argument is synthesized in the following result. 

Lemma 4.1 For any e > and any real-valued random variable V such that 
E[exp(V^)] < 1, with probability at least 1 — e, we have 





V < log(e-^). 
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Th*) + J(7) + log / exp [-£(/)]vr(rf/) - log 



dp 
dn 



(/) 



and = - log j exp[-2(/)]7r(rf/)') +iog(^j exp[-£"(/)]7r(rf/) 
To prove the theorem, according to Lemma 4.1, it suffices to prove that 



< 1. 



E|/exp[l^i(/)]p(rf/)j<l and E[J exp{V2)pm 
These two inequalities are proved in the following two sections. 
4.2.1. PwofofKlj exp [Vi(/)]p((i/) I" < 1. From Jensen's inequality, we have 



[L\fJ)+^*R{f)]^*_^,^{df) 

= I [LCfJ) + YRU)V-.rR{df) + j [L\fJ)-LCfJ)y_,.M) 

< J [L{f J) + ^'' R{f)]7r*_^,j^{df ) + log Jexp[L\fJ)-L{fJ)y_^,^^^^^ 



From Jensen's inequality again. 



-£(/) = -log J exp[L{fJ)]7r*{df) 

= -\og [ exp[L{fJ)+^*R{f)]7r*_^.^{df)-\og j exp [-7*i?(/)]7r*(#) 



From the two previous inequalities, we get 



^i(/)< / [LU,f) + l*RU)V^rB^df) 

flog / exp[L^(/,/)-L(/,/)]7r*(rf/)-7i?(/) 



-J*(7*)+%) + log(y" exp[-£(/)]7r(rf/)^ - log 

[L(/,/)+7*i?(/)]vrl^.;^(rf/) 

flog j exp[L\fJ)-L{fJ)y{df)-^R{f) 



dp 

dTT 



if) 
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dp 
dir 



if) 



-r(7*)+J(7)-£(/)-lo; 

<log J exp[L\fJ)-L{fJ)]7r*_^,j^{df) 

dp 



dir 



if) 



dTi_ 



dp 



if) 



-7W)+%)-log 

= log j exp[L\f J)- L{fJ)y_^.^{df) + log 

hence, by using Fubini's inequality and the equality 

E|exp[-L(/,/)]} =exp[-L^(/,/)], 
we obtain E j exp[Vi{f)]p{df) 

<eJ (^J exp[L\fJ) - L{fJ)]7r*_^,^{df)y,^nidf) 

= J (^jEexp[L\fJ)-L{fJ)]7r*_^,j^{df)y_^R{df) 



4.2.2. Proof of E 



Jexp{V2)pidf) 



< 1. It relies on the following result. 



Lemma 4.2 Let W be a real-valued measurable function defined on a product 
space Ai x A2 and let pi and p2 be probability distributions on respectively Ai 
and A2. 

• ifEa-^^^-^ |log IEa2^^2 |exp [— W(ai, 02)] } | < +c>o, then we have 



Eai-A^i yog Ea2^^2 {exp[-W(ai,a2)] } 



< - log<^ E 



exp[~Ea^^^, W(ai,a2)]]|. 



-1 



ifW > on Ai X A2 andEa2r^^2 |lEai~^ii [W(ai, 02)] ^ )■ < +00, then 
'W(ai, 02)""^ 



<Ea,^^, {e„,.^, [W(ai,a2)] 



Proof. 



Let yi be a measurable space and M denote the set of probability distribu- 
tions on A. The KuUback-Leibler divergence between a distribution p and a 
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distribution /i is 



K{p, /i) 



Ea~p log 



-oo 



Ct/i 



if p < /i, 
otherwise. 



(4.4) 



dp 



where — denotes as usual the density of p w.r.t. p. The Kullback-Leibler 

divergence satisfies the duality formula (see, e.g., [10, page 159]): for any 
real-valued measurable function h defined on A, 

mf {Ea^p h{a) + K{p,fi)} = -logE,^^ |exp[-/i(a)] }. (4.5) 
By using twice (4.5) and Fubini's theorem, we have 



log 



< 



exp[-W(ai,a2)] || 
E,,^^, jinf {E,,^^ [W(ai,a2)] +/^(p,/i2)}} 

inf \ Ea^^^j 
p I- 



- log E, 



Ea2~p [W(ai,a2)] +K{p,p2] 
exp{-Ea^^^^ [W(ai,a2)]} 



By using twice (4.5) and the first assertion of Lemma 4.2, we have 



E„ 



'ai~/ii 1 ^02^/^2 



E„ 



= E„^^^j jexpj-log Ea2^^2 {exp[-logW(ai,a2)] } || 
jexpjinf [e^^^p {log[W(ai, 02)] } + /s:(p, /ia)] } } 

< inf|exp[i^(p,/i2)]Ea^^^^ jexpjEaj^p log[W(ai, 02)] || 

< inf|exp[Ji'(p,/i2)] exp|Ea2~p jlog Ea^^^^ [W(ai,a2)] }} 
= exp|inf|E„2^p log{E„,^^, [W(ai,a2)]} +K(p,p2)}} 



exp<^ -log<^E 



-'a2~P2 



exp 
= E 



'a2~/i2 



log{E,,^^, [W(ai,a2)]} 
IEai~pi [W(ai,a2 



□ 



From Lemma 4.2 and Fubini's theorem, since V2 does not depend on /, we have 
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E 



exp [-£(/)] nidf) 



exp{V2)p{df) =E[exp(F2)] 
exp [-£«(/)] 7r(rf/)E 
< y exp [-£«(/)] 7r(d/) |y E[exp(£(/))]"V(ci/) 
exp[-£«(/)]vr(d/)|y" E J exp [L (/,/')] ^*(^/' 
exp [-£«(/)] 7r(rf/)( [\ [exp[L\fJ')]7r*{df 



-1 



1 -1 



n{df) 



-1 



-1 



7r(d/) 



-1 



1. 



This concludes the proof that for any 7 > 0, 7* > and e > 0, with probability 
(with respect to the distribution P'^"p generating the observations Zi, . . . , Zn and 
the randomized prediction function /) at least 1 — 2e: 

Vi{f) + V2<2\og{e-'). 

4.3. Proof of Lemma 3.3. Let us look at J from the point of view of /*. 
Precisely let S^d{0, 1) be the sphere of centered at the origin and with radius 
1 and 

^ d 

-j=i 
Introduce 

n = {(f) e §;3u > s.t. f + u(f)e 3"}. 

For any </> G fi, let = sup{n > : f* + u(p E 5"}. Since vr is the uniform 
distribution on the convex set 5" (i.e., the one coming from the uniform distribution 
on 6), we have 



exp{-a[R{f) - R{n]}n{df) 

exp{-a[R{f* + 



(pen Jo 



uc 



R{f*)]]u'^-^dud(p. 



Let = E[(/.(X)4(/*W)] and = E[02(X)] . Since 

r Gargmin^,jE{£y[/W]}, 

we have > (and = if both —0 and belong to il). Moreover from 
Taylor's expansion, 

— ^ — < R{f + u(j)) - R[f )- uc^ < — - — . 
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Introduce 

^ /q"** exp{— a[uc0 + l^io^n^] 

JJ''* exp|— /3[mc<^ + i62'^(/)W^]}M'^~^(iM 
For any < a < /3, we have 

/exp{-a[i?(/)-i?(r)]}7r(^/) ^. 
/exp{-/3[i?(/) - /?(/*)]}7r(rf/) - 

For any ^ > 1, by a change of variable, 

/q"'^ exp{-/3[nc0 + \h2a^v?]]u'^-^du 
< C^sup exp{/3[uc0 + lb2a^u'^] - a[Cuc<p + ^bia^C'^u'^]}. 

u>0 

Taking C = ^(MVIM) when = and C = v^(M)7(M) V other- 
wise, we obtain 'ijj^ < C^, hence 

r r ro^^N D//*\n I ^log (r^) when supc,^ = 0, 

J exp{-a[R{f) - R{f*)]}n{df)\ I 2 ^fc^a^ 



^°^'jexp{-/3[/?(/)-/2(/*)]}7r(d/)J " 1 , 



which proves the announced result. 



dlog (\- — V — ) otherwise, 

^ bia a 



4.4. Proof of Lemma 3.4. For -{2AHy^ < A < {2AHy^, introduce the 
random variables 

F = fix) F* = r(X), 

fi = 4(F*) + (F-F*) [ {I - t)i'^{F* + t{F - F*))dt, 

Jo 

L = X[£iY, F) - £{¥, F% 

and the quantities 

_ MM^exp(m2M) 
^" 2V^{1 -\X\AH) 

and 

A = Hb2/2 + A\og{M) = ^\og{M^ exp[Hb2/{2A)]}. 
From Taylor-Lagrange formula, we have 

L = X{F-F*)n. 
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SinceE[exp(|fi|/A) | X] < Mexp[m2/(2A)], Lemma D.2 gives 

M^a^exp{Hb2/A) 



log<^E 



exp{a[n-E{n\X)]/A} \X 



< 



- \(x\) 



for any — 1 < a < 1, and 



|E(f}|X)|<i. (4.6) 
By considering a = AX[f{x) - f*{x)] G [-1/2; 1/2] for fixed a; G X, we get 

logjE exp[L-E(L|X)] |X | < X\F - F*ya{X). (4.7) 

Let us put moreover 

L = E{L\X) + a{X)X\F - F*f. 

Since -{2AH)-^ <X< {2AH)-\ we have L < \X\HA + a{X)X'^H^ < V with 
h' = A/{2A) + APexp{Hb2/A)/{A^). Since L - E(L) = L - E{L\X) + 
E(L|X) - E(L), by using Lemma D.l, (4.7) and (4.6), we obtain 

logjE exp[L-E(L)] | < logjE exp[L-E(L)] 'j + X^a{X)E[{F - F*)^] 

< E{L^)g{h') + X^a{X)E[{F - F*f] 

< A^E [{F - F*f] [A^g{h') + a(A)] , 

with g{u) = [exp(n) — 1 — n]/n^. Computations show that for any —{2AH)~^ < 
X < {2AH)-\ 

A^g{b') + a(A) < — exp [m^ exp{Hb2/A) 

Consequently, for any -{2AHy^ < A < {2AH)~^, we have 

log{E[exp{A[£(F,F) -£(r,F*)]} 

< X[R{f)-R{f*)] + X^E[{F-F*)^] — exp^M^exp{Hb2/A) 

Now it remains to notice that E[(F - F*)^] < 2[R{f) - R{f*)]/bi. Indeed con- 
sider the function 0(t) = R{f* + t{f - /*)) - R{f*), where / G 3^ and t G [0; 1]. 
From the definition of /* and the convexity of 5", we have > on [0; 1], imply- 
ing that 0'(O) > 0. Besides 0(1) = 0(0) + 0'(O) + J^{1 - t)^"{t)dt, where (p"{t) 
is defined as 

0"(t) = e{ [/(X) - nx)fi'{. 0(1 - t)r + f]{x)] 
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> 



hE{[fix)-nx)Y}, 



implying that 



hE{F-F*f<R{f)-R{r). 



(4.8) 



4.5. Proof of Lemma 3.6. We have 

E[{[Y-f{x)r-[Y-nx)rY 
= E([nx) - f{x)]'{2[Y - nx)] + inx) - /(x)]} 

= E([r(X) - f{X)]'{AE{[Y - nX)]'\X) 

+ AE{Y - nx)\x)[nx) - f{x)] + inx) - /(x)]^}) 

< E([r(X) - f{X)]'{Aa' + Aa\nX) - /(X)| + [/* (X) - /(X)]^}) 
<E{[nX)-f{X)]\2a + Hf) 

< {2a + Hf[RU) - R{ni 

where the last inequality is the usual relation between excess risk and distance 
using the convexity of 5" (see above (4.8) for a proof). 



4.6. Proof of Lemma 3.7. Let S = {s e J'lin : E[s{Xf 
triangular inequality in L^, we get 



1}. Using the 



E(^{[Y-f{X)f-[Y-r{X)ff 
= e({2[/*(X) - f{X)][Y - r{X)] + [r(X) - f{X)fY 
< (2y/E{[/*(X) - fiXmY - /*(X)]2} + ^E{[f*{X) - f{X)]^} 



< 



2^E{[f*{X) - /(X)]2) JsupE(s(X)2[F - /*(X)]2) 



E([r(X)-/(X)]2)JsupE[.(X)4] 



<V[R{f)-R{f% 



with 



2jsupE(s(X)2[r-/*(X)]2) 
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+ / sup E([/'(X)-/"(X)]2) /supE[.(X)4] 



where the last inequality is the usual relation between excess risk and distance 
using the convexity of 3^ (see above (4.8) for a proof). 



A. Uniformly bounded conditional variance is necessary to 

REACH d/n RATE 

In this section, we show that the target (0.3) cannot be reached if we just assume 
that Y has a finite variance and that the functions in 5" are bounded. For this 
purpose, the following result gives a 1/ ^/n lower bound when d = 2. (Note that 
it is not implied by the A/log(l + dj ^fn)ln lower bound for convex aggregation, 
proved in [25], and in slightly weaker forms in [18, 27], since the latter bound is 
shown for d > ^/n.) 

For this, consider an input space X partitioned into two sets Xi and X2: X = 
Xi U X2 and Xi fl X2 = 0. Let ^pi{x) = l^eXi and (p2{x) = 1x6X2- Let J = 
{^1(^1 + e2<^2; (^1,^2) G [-1,1]'}. 

Theorem A. 1 For any estimator f and any training set size n > 1, we have 

sup {E[R{f)] - Rif*)} > (A.l) 

where the supremum is taken with respect to all probability distributions such that 
e 3" and Var(r ) < 1. 

Proof. Let (3 satisfying < /3 < 1 be some parameter to be chosen later. 
Let Po-, cr E {—, +}, be two probability distributions on X x M such that for any 

{-,+}, 

P^(Xi) = l-/3, 

P^{Y = 0\X = x) = 1 foranyxeXi, 

and 



One can easily check that for any a E { — , +}, Yar p^{Y) = 1 — /3 < 1 and 
j(reg)^^^ _ ^ prove Theorem A.l, it suffices to prove (A.l) when the 



I -PAY 



1 

7^ 



X = x^ for any x E X2. 
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supremum is taken among P G {P_,P+}. This is done by applying Theorem 
8.2 of [3]. Indeed, the pair (P_, P+) forms a (1, /3, /3)-hypercube in the sense of 
Definition 8.2 with edge discrepancy of type I (see (8.5), (8.11) and (10.20) for 
q = 2): di = 1. We obtain 

sup {E[R{f)] - R{n} > - /3v^), 

PG{P-,P+} 

which gives the desired result by taking (3 = 1/ {2y/n). □ 



B. Empirical risk minimization on a ball: analysis derived from 

THE WORK OF BiRGE AND MASSART 

We will use the following covering number upper bound [21, Lemma 1] 

Lemma B.l If 3^ has a diameter upper bounded by H for the L°^-norm (i.e., 
^^P/i,/2e3',2^^ex 1/1(3^) ~ /a (2;) I < H), then for any < 6 < H, there exists a set 
3^* C 5", of cardinality |3^*| < {3H/6y such that for any f E there exists 
g G 3^* such that \\f — g\\oo < 

We apply a slightly improved version of Theorem 5 in Birge and Massart [7]. 
First for homogeneity purpose, we modify Assumption M2 by replacing the con- 
dition "cr^ > D/n" by "cr^ > B'^D/n" where the constant B is the one appearing 
in (5.3) of [7]. This modifies Theorem 5 of [7] to the extent that "VI" should be 
replaced with "VP^". Our second modification is to remove the assumption that 
Wi and Xi are independent. A careful look at the proof shows that the result still 
holds when (5.2) is replaced by: for any x G X, and m > 2 

Es[M"'{Wi)\Xi =x]< amA"", for alH = 

We consider = F-/*(X), 7(z, /) = {y- f{x)Y, A{x,u,v) = \u{x)-v{x)\, 
and M{w) = 2{\w\ + H). From (1.7), for all m > 2, we have E{[(2(|M/| + 
H)Y^\X = x\< ^ [4M(A + P')]'". Now consider P' and r such that Assumption 
M2 of [7] holds for D = d. Inequality (5.8) for r = 1/2 of [7] imphes that 
for any v > + P^) log(2P' + B'r^/d/n), with probability at least 1 - 

r —nv 

P(/>-)) - R{f*) + r(/*) - r(/>™)) < (E{ [/>-)(X) - fixf} V v)/2 

for some large enough constant k depending on M. Now from Proposition 1 of 
[7] and Lemma B.l, one can take either P' = 6 and = \/p or B' = 3y^n/d 
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and r = 1. By using E{ [p''''"\X) - f*{X)Y} < - R{f*) (since J is 

convex and /* is the orthogonal projection of F on 3^), and r(/*) — r{f^^™^^) > 
(by definition of Z^™^'), the desired result can be derived. 

Theorem 1.5 provides a d/n rate provided that the geometrical quantity B is 
at most of order n. Inequality (3.2) of [7] allows to bracket B in terms of B = 
sup/6,pa„{^„...,^,} namely B < B < Bd.To understand better 

how this quantity behaves and to illustrate some of the presented results, let us 
give the following simple example. 

Example 1. Let Ai, . . . , Adhc a partition of X, i.e., X = W^^-^^Aj. Now con- 
sider the indicator functions ipj = 1^^, j = I, d: (pj is equal to 1 on Aj 
and zero elsewhere. Consider that X and Y are independent and that F is a 
Gaussian random variable with mean 9 and variance a^. In this situation: = 
j(reg) _ ^^^^ 0(pj_ According to Theorem 1.1, if we know an upper bound H on 

11/^'''°^ I loo = d, we have that the truncated estimator (/(°''^ A if ) V -H satisfies 

for some numerical constant k. Let us now apply Theorem C.l. Introduce pj = 

F{X e Aj) andpmin = miUjPj. We have Q = {Eipj{X)ipk{X)) .^^ = Diag(pj), 

X = 1 and II 61* II = OVd. We can take A = o- and M = 2. From Theorem C.l, 
for A = dL^/n, as soon as A < Pmm, the ridge regression estimator satisfies with 
probability at least 1 — e: 

^(/(ridge)) _ < ^^^d ^ ^^ ^^ 

for some numerical constant k. When d is large, the term {d^J^D / (j^Pmin) is felt, 
and leads to suboptimal rates. Specifically, since Pmin < ^/d, the r.h.s. of (B.l) 
is greater than /n^, which is much larger than d/n when d is much larger than 
n^^^. If Y is not Gaussian but almost surely uniformly bounded by C < +oo, then 
the randomized estimator proposed in Theorem 1.3 satisfies the nicer property: 
with probability at least 1 — e. 



R{f) - R{fL) < + c 



2,c/log(3pJ^) + log((logn)e ^] 



n 



for some numerical constant k. In this example, one can check that B = B' = 
l/Pmin where pmin = minj P(X G Aj). As long as p^ym target (0.2) 

is reached from Corollary 1.5. Otherwise, without this assumption, the rate is in 
{dlog{n/d))/n. ■ 
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C. Ridge regression analysis from the work of Caponnetto and 

De Vito 

From [8], one can derive the following risk bound for the ridge estimator. 

Theorem C.l Let gmin be the smallest eigenvalue of the d x d-product matrix 
Q = {Eipj{X)(pkiX)).^. Let X = sup^^^ Ej=i V'i(a^)^- Let \\e*\\ be the Eu- 
clidean norm of the vector of parameters of f^^^ = X]j=i ^j'^j- Let < e < 1/2 
and /Cg = log^(£:^^). Assume that for any x G X, 

E{exp[|r-/*„(X)|/A] \X = x]<M. 

For A = {Xd£j^)/n, if \ < gmin> the ridge regression estimator satisfies with 
probability at least 1 — e: 

^(/(ridge)) _ j^^f.^^ ^KL^r^,^ ^XL,\\e*\A (C.l) 

for some positive constant n depending only on M. 

Proof. One can check that G argmin^^j^ ^(/)+AEi=i where 
is the reproducing kernel Hilbert space associated with the kernel K : (x, x') (-> 
Ej=i Vj{x)Vk{x'). Introduce G argmin^^^^ ^(/) + A Ej=i WfWli- Let us use 
Theorem 4 in [8] and the notation defined in their Section 5.2. Let be the column 
vector of functions [(pj]j=i, Diag(aj) denote the diagonal d x d-matrix whose j- 
th element on the diagonal is aj, and Id be the d x rf-identity matrix. Let U and 
gi, . . . , grf be such that UU^ = I and Q = f/Diag(gj)t/^. We have /it„ = cp'^O* 
and /(^) = ip^iQ + Xiy^QO*, hence 

fL - f'^ = ^^t/Diag(A/(g, + X))U^e*. 

After some computations, we obtain that the residual, reconstruction error and 
effective dimension respectively satisfy yi(A) < ^— ||6'*|p, !B(A) < 

^min ^min 

and 3\f(A) < d. The result is obtained by noticing that the leading terms in (34) of 
[8] are ^(A) and the term with the effective dimension !N(A). □ 

The dependence in the sample size n is correct since 1 /n is known to be mini- 
max optimal. The dependence on the dimension d is not optimal, as it is observed 
in the example given page 36. Besides the high probability bound (C.l) holds only 
for a regularization parameter A depending on the confidence level e. So we do 
not have a single estimator satisfying a PAC bound for every confidence level. 
Finally the dependence on the confidence level is larger than expected. It contains 
an unusual square. The example given page 36 illustrates Theorem C.l . 
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D. Some standard upper bounds on log-Laplace transforms 



Lemma D . 1 Let V be a random variable almost surely bounded Zjj 6 G M. Let 

g : u ^ [exp('u) — 1 — wj/w^. 



log<^E 



exp[\/ - E(y)] \ < E(y2)^(6). 



Proof. Since g is an increasing function, we have g{V) < g{b). By using the 
inequality log(l + u) < u, we obtain 



log<^E 



exp [V - E{V)] 



-E{V) + log{E [1 + V + VgiV)] } 

<E[V^giV)] <E{V')g{b). 



□ 



Lemma D. 2 Let V be a real-valued random variable such that E\exp(jV\)^ < 
M for some M > 0. Then we have |E(V^)| < logM, and for any — 1 < a < 1, 



log<^E 



exp{a[V -E{V)]} 



< 



2v^(l- |«|)' 



Proof. First note that by Jensen's inequality, we have |E(y)| < log(M). By 
using log(u) < u — 1 and Stirling's formula, for any — 1 < a < 1, we have 



log<^E 



exp{a[V -E{V)]} 



< E 



exp{a[V -E{V)]} 



= E|exp{a [V - E{V)] } - 1 - a[V - E{V)] } 
< E|exp[|a||1/ - E{V)\] - 1 - \a\\V - E(V)|| 
< Ejexp [\V - E{V)\\ I supj [exp(|a|M) - 1 - \a\u\ exp(- 

< E exp {\V\ + \E{y)\) I sup ^ -^ZT- 



-u 



u>Q " rn\ 

— m>2 



m-2 



< ^ J— ^ sup u"" exp(-u) = a^M^ ^ ' ' exp(-m 



m>2 



m>2 



m-2 



< a^M^ > ' ' < — — — . 



□ 
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